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Abstract — Overlap is one of the characteristics of social 
networks, in which a person may belong to more than one social 
group. For this reason, discovering overlapping structures 
is necessary for realistic social analysis. In this paper, we 
present a novel, general framework to detect and analyze 
both individual overlapping nodes and entire communities. In 
this framework, nodes exchange labels according to dynamic 
interaction rules. A specific implementation called Speaker- 
listener Label Propagation Algorithm (SLP^Q) demonstrates 
an excellent performance in identifying both overlapping nodes 
and overlapping communities with different degrees of diver- 
sity. 

Keywords -soda\ network; overlapping community detection; 
label propagation; dynamic interaction; algorithm; 

I. Introduction 

Modular structure is considered to be the building block of 
real-world networks as it often accounts for the functionality 
of the system. It has been well understood that people in 
a social network are naturally characterized by multiple 
community memberships. For example, a person usually has 
connections to several social groups like family, friends and 
colleges; a researcher may be active in several areas; in 
the Internet, a person can simultaneously subscribe to an 
arbitrary number of groups. 

For this reason, overlapping community detection algo- 
rithms have been investigated. These algorithms aim to 
discover a cover (T), which is defined as a set of clusters in 
which each node belongs to at least one cluster. In this paper, 
we propose an efficient algorithm to identify both individual 
overlapping nodes and the entire overlapping communities 
using the underlying network structure alone. 

II. Related Work 

The work on detecting overlapping communities was 
previously proposed by Palla [2] with the clique percolation 
algorithm (CPM). CPM is based on the assumption that 
a community consists of fully connected subgraphs and 
detects overlapping communities by searching for each such 
subgraph for adjacent cliques that share with it at least 

'SLPA 1.0: https://sites.google.com/site/communitydetectionslpa/ 



certain number of nodes. CPM is suitable for networks with 
dense connected parts. 

Another line of research is based on maximizing a local 
benefit function. Baumes [3| proposed the iterative scan 
algorithm (IS). IS expands seeded small cluster cores by 
adding or removing nodes until the local density function 
cannot be improved. The quality of discovered communi- 
ties depends on the quality of seeds. LFM ||4) expands 
a community from a random seed node until the fitness 
function is locally maximal. LFM depends significantly on 
a parameter of the fitness function that controls the size of 
the communities. 

CONGA [5 1 extends Girvan and Newman's divisive clus- 
tering algorithm by allowing a node to split into multiple 
copies. Both splitting betweenness defined based on the 
number of shortest paths on the imaginary edge and the usual 
edge betweenness are considered. In the refined version of 
CONGO [6 1, local betweenness is used to optimize the 
speed. 

Copra is an extension of the label propagation al- 
gorithm O for overlapping community detection. Each 
node updates its belonging coefficients by averaging the 
coefficients over all its neighbors. Copra produces a number 
of small size communities in some networks. 

EAGLE |9| uses the agglomerative framework to produce 
a dendrogram. All maximal cliques that serve as initial com- 
munities are first computed. Then, the pair of communities 
with maximum similarity is merged iteratively. Expensive 
computation is one drawback of this algorithm. 

Fuzzy clustering has also been extended to overlapping 
community detection. Zhang [10] used the spectral method 
to embed the graph into k-l dimensional Euclidean space. 
Nodes are then clustered by the fuzzy c-mean algorithm. 
Nepusz 1TT1 modeled the overlapping community detection 
as a nonlinear constraint optimization problem. Psorakis et 
al. |fl~2| proposed a model based on Bayesian nonnegative 
matrix factorization (NMF). 

The idea of partitioning links instead of nodes to discover 
community structure has also been explored iTOl . fl4l . As 
a result, the node partition of a line (or link) graph leads to 



an edge partition of the original graph. 

III. SLPA: Speaker-listener Label Propagation 
Algorithm 

The algorithm proposed in this paper is an extension of 
the Label Propagation Algorithm (LPA) 0. In LPA, each 
node holds only a single label that is iteratively updated by 
adopting the majority label in the neighborhood. Disjoint 
communities are discovered when the algorithm converges. 
One way to account for overlap is to allow each node to 
possess multiple labels as proposed in [15|. Our algorithm 
follows this idea but applies different dynamics with more 
general features. 

In the dynamic process, we need to determine 1) how to 
spread nodes' information to others; 2) how to process the 
information received from others. The critical issue related 
to both questions is how information should be maintained. 
We propose a speaker-listener based information propagation 
process (SLPA) to mimic human communication behavior. 

In SLPA, each node can be a listener or a speaker. The 
roles are switched depending on whether a node serves as 
an information provider or information consumer. Typically, 
a node can hold as many labels as it likes, depending on 
what it has experienced in the stochastic processes driven 
by the underlying network structure. A node accumulates 
knowledge of repeatedly observed labels instead of erasing 
all but one of them. Moreover, the more a node observes a 
label, the more likely it will spread this label to other nodes 
(mimicking people's preference of spreading most frequently 
discussed opinions). 

In a nutshell, SLPA consists of the following three stages 
(see algorithm Q] for pseudo-code): 

1) First, the memory of each node is initialized with 
this node's id (i.e., with a unique label). 

2) Then, the following steps are repeated until the stop 
criterion is satisfied: 

a. One node is selected as a listener. 

b. Each neighbor of the selected node sends 
out a single label following certain speak- 
ing rule, such as selecting a random label 
from its memory with probability propor- 
tional to the occurrence frequency of this 
label in the memory. 

c. The listener accepts one label from the 
collection of labels received from neigh- 
bors following certain listening rule, such 
as selecting the most popular label from 
what it observed in the current step. 

3) Finally, the post-processing based on the labels in 
the memories of nodes is applied to output the 
communities. 

SLPA utilizes an asynchronous update scheme, i.e., when 
updating a listener's memory at time t, some already updated 



Algorithm 1 : SLPA(r, r) 
[n,Nodes]=loadnetwork(); 
Stage 1: initialization 
for i — 1 : n do 

Nodes(i).Mem=i; 
Stage 2: evolution 
for t = 1 : T do 

Nodes.ShuffleOrderO; 
for i = 1 : n do 
Listener=Nodes(i); 
Speakers=Nodes(i).getNbs(); 
for j = 1 : Speakers.len do 

LabelList(j )= Speakersfj ) . speakerRule() ; 
w=Listener.listenerRule(LabelList); 
Listener.Mem.add(w); 
Stage 3: post-processing 
for i = 1 : n do 

remove Nodes(i) labels seen with probability < r; 



neighbors have memories of size t and some other neighbors 
still have memories of size t—1, SLPA reduces to LPA when 
the size of memory is limited to one and the stop criterion 
is convergence of all labels. 

It is worth noticing that each node in our system has a 
memory and takes into account information that has been 
observed in the past to make current decision. This feature 
is typically absent in other label propagation algorithms such 
as 03), lfl6l . where a node updates its label completely 
forgetting the old knowledge. This feature allows us to 
combine the accuracy of the asynchronous update with 
the stability of the synchronous update ®. As a result, 
the fragmentation issue of producing a number of small 
size communities observed in Copra in some networks, is 
avoided. 

A. Stop Criterion 

The original LPA stop criterion of having every node 
assigned the most popular label in its neighborhood does 
not apply to the multiple labels case. Since neither the 
case where the algorithm reaches a single community (i.e., 
a special convergence state) nor the oscillation (e.g., on 
bipartite network) would affect the stability of SLPA, we can 
stop at any time as long as we collect sufficient information 
for post-processing. In the current implementation, SLPA 
simply stops when the predefined maximum number of 
iterations T is reached. In general, SLPA produces relatively 
stable outputs, independent of network size or structure, 
when T is greater than 20. 

B. Post-processing and Community Detection 

SLPA collects only label information that reflects the 
underlying network structure during the evolution. The 
detection of communities is performed when the stored 




Figure 1 . The convergence behavior of the parameter in LFR Figure 2. The execution time in second of SLPA in LFR benchmarks 
benchmarks with n = 5000. y-axis is the ratio of the numbers of with k = 20. n ranges from 1000 to 50000. 
detected to true communities. 



information is post-processed. Given the memory of a node, 
SLPA converts it into a probability distribution of labels. 
This distribution defines the strength of association to com- 
munities to which the node belongs. This distribution can be 
used for fuzzy communities detection IfTTl . More often than 
not, one would like to produce crisp communities in which 
the membership of a node to a given community is binary, 
i.e., either a node is in a community or not. To this end, a 
simple thresholding procedure is performed. If the probabil- 
ity of seeing a particular label during the whole process is 
less than a given threshold r £ [0,1], this label is deleted 
from a node's memory. After thresholding, connected nodes 
having a particular label are grouped together and form a 
community. If a node contains multiple labels, it belongs 
to more than one community and is therefore called an 
overlapping node. In SLPA, we remove nested communities, 
so the final communities are maximal. 

As shown in Fig. Q] SLPA converges (i.e., producing 
similar output) quickly as the parameter r varies. The 
effective range is typically narrow. Note that the threshold is 
used only in the post-processing. It means that the dynamics 
of SLPA is completely determined by the network structure 
and the interaction rules. The number of memberships is 
constrained only by the node degree. In contrast, Copra uses 
a parameter to control the maximum number of memberships 
granted during the iterations. 

C. Complexity 

The initialization of labels requires 0(n), where n is 
the total number of nodes. The outer loop is controlled 
by the user defined maximum iteration T, which is a small 
constant The inner loop is controlled by n. Each operation 
of the inner loop executes one speaking rule and one 
listening rule. For the speaking rule, selecting a label from 
the memory proportionally to the frequencies is, in principle, 
equivalent to randomly selecting an element from the array, 

2 In our experiments, we used T set to 100. 



which is O(l) operation. For listening rule, since the listener 
needs to check all the labels from its neighbors, it takes 
0{K) on average, where K is the average degree. The 
complexity of the dynamic evolution (i.e., stage 1 and 2) for 
the asynchronous update is 0(Tm) on an arbitrary network 
and 0(Tn) on a sparse network, when m is the total number 
of edges. In the post-processing, the thresholding operation 
requires 0(Tn) operations since each node has a memory 
of size T. 

Therefore, the time complexity of the entire algorithm is 
0(Tn) in sparse networks. For a naive implementation, the 
execution time on synthetic networks (see section [TVb scales 
slightly faster than a linear growth with n as shown in Fig. 
12 

IV. Experiments and Results 
A. Benchmark Networks 

To study the behavior of SLPA for overlapping com- 
munity detection, we conducted extensive experiments on 
both synthetic and real- world networks. Table Q] lists the 
classical social networks for our tests and their statistics 
0. For synthetic networks, we adopted the LFR benchmark 
ifTHl . which is a special case of the planted I -partition model, 
but characterized by heterogeneous distributions of node 
degrees and community sizes. 

In our experiments, we used networks with size n = 5000. 
The average degree is kept at k = 10 which is of the same 
order as most of the real-world networks we tested. The rest 
of the parameters are as follows: node degrees and commu- 
nity sizes are governed by the power laws, with exponents 
2 and 1; the maximum degree is 50; the community size 
varies between 20 and 100; the mixing parameter p, varies 
from 0.1 to 0.3, which is the expected fraction of links of a 
node connecting it to other communities. 

The degree of overlapping is determined by parameters 
O n (i.e., the number of overlapping nodes) and O m (i.e., 

3 Data are available at http://www-personal.umich.edu/~mejn/netdata/ 
and http://deim.urv.cat/~ aarenas/data/welcome.htm 
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-0 — CFinder k=3 
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Figure 3. F-score for networks with n = 5000, k = 10, /i = 0.1. Figure 4. F-score for networks with n = 5000, k = 10, fi = 0.3. 
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■0 — CFinder k=3 
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-B — SLPA r=0.05 
-V — Copra v=8 
-0 — CFinder k=3 
— — LFM a=1 
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Figure 5. NMI for networks with n ■ 
k = 10, fj, = 0.1. 



: 5000, Figure 6. Ratio of the detected to the known 
numbers of memberships for networks with 
n = 5000, k = 10, fi = 0.1. Values over 1 are 
possible when more memberships are detected 
than there are known to exist. 



Figure 7. Ratio of the detected to the known 
numbers of^ communities for networks with 
n = 5000, k = 10, f-t — 0.1. Values over 1 are 
possible when more communities are detected 
than there are known to exist. 
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e — SLPA r=0.1 
V — Copra v=8 
4 — CFinder k=3 
- — LFM a=1 
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Figure 8. NMI for networks with n ■ 
k = 10, fj, = 0.3. 



5000, Figure 9. Ratio of the detected to the known 
numbers of memberships for networks with 
n = 5000, k = 10, fi = 0.3. Values over 1 are 
possible when more memberships are detected 
than there are known to exist. 



Figure 10. Ratio of the detected to the known 
numbers of communities for networks with 
n = 5000, k = 10, /i = 0.3. Values over 1 are 
possible when more communities are detected 
than there are known to exist. 



the number of communities to which each overlapping node 
belongs). We fixed the former to be 10% of the total number 
of nodes. The latter, the most important parameter for our 
test, varies from 2 to 8 indicating the diversity of overlapping 
nodes. By increasing the value of O m , we create harder 
detection tasks. 

We compared SLPA with three well-known algorithms, 
including CFinder (the implementation of clique propagation 
algorithm [2]), Copra (another label propagation algo- 
rithm), and LFM J4] (an algorithm expanding communities 
based on a fitness function). Parameters for those algorithms 



were set as follows. For CFinder, k varied from 3 to 10; 
for Copra, v varied from 1 to 10; for LFM a was set to 
1.0 which was reported to give good results. For SLPA, the 
maximum number of iterations T was set to 100 and r varied 
from 0.01 to 0.1 to determine its optimal value. The average 
performances over ten repetitions are reported for SLPA and 
Copra. 

B. Identifying Overlapping Nodes in Synthetic Networks 

Allowing overlapping nodes is the key feature of over- 
lapping communities. For this reason, the ability to identify 
overlapping nodes is an essential component for quantifying 
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the quality of a detection algorithm. However, the node level 
evaluation is often neglected in previous work. 

Note that the number of overlapping nodes alone is 
not sufficient to quantify the detection performance. To 
provide more precise analysis, we define the identification 
of overlapping nodes as a binary classification problem. We 
use F-score as a measure of accuracy, which is the harmonic 
mean of precision (i.e., the number of overlapping nodes 
detected correctly divided by the total number of detected 
overlapping node) and recall (i.e., the number of overlapping 
nodes discovered correctly divided by the expected value of 
overlapping nodes, 500 here). 

Fig. |3] and |4] show the F-score as a functions of the 
number of memberships. SLPA achieves the largest F-score 
in networks with different levels of mixture, as defined 
by p. CFinder and Copra have close performance in the 
test. Interestingly, SLPA has a positive correlation with 
O m while other algorithms typically demonstrate a negative 
correlation. This is due to the high precision of SLPA when 
each node may belong to many groups. 

C. Identifying Overlapping Communities in Synthetic Net- 
works 

Most measures for quantifying the quality of a partition 
are not suitable for a cover produced by overlapping detec- 
tion algorithms. We adopted the extended normalized mutual 
information (NMI) proposed by Lancichinetti Q~). NMI 
yields the values between and 1, with 1 corresponding 
to a perfect matching. The best performances in terms of 
NMI are shown in Fig. [5] and Fig. [8] for all algorithms with 
optimal parameters. 

The higher NMI of SLPA clearly shows that it outper- 
forms other algorithms over different networks structures 
(i.e., with different /i's). Comparing the number of de- 
tected communities and the average number of detected 
memberships with the ground truth in the benchmark helps 
understand the results. As shown in Fig. [6] |7] [9] and [TUl 
both quantities reported by SLPA are closer to the ground 
truth than those reported by other algorithms. This is even 
the case for /i=0.3 and large O m . The decrease in NMI is 



also relatively slow, indicating that SLPA is less sensitive to 
diversity of O m . In contrast, Copra drops fastest with the 
growth of O m , even though it is better than CFinder and 
LFM on average. 

D. Identifying Overlapping Communities in Real-world So- 
cial Networks 

To evaluate the performance of overlapping community 
detection in real-world networks, we used an overlapping 
measure, Q ov , proposed by Nicosia Q3Q . It is an extension 
of Newman's Modularity EOl . As the Q ov function, we 
adopted the one used in |7), f(x) = 60x — 30. Q ov values 
vary between and 1 . The larger the value is, the better the 
performance is. 

In this test, SLPA uses r in the range from 0.02 to 0.45. 
Other algorithms use the same parameters as before. In 
Table U the r, v and k are parameters of the corresponding 
algorithms. LFM used p = 1.0. For SLPA and Copra, the 
algorithms repeated 100 times and recorded the average and 
standard deviation (std) of Q ov . 

As shown in Table Q] SLPA achieves the highest Q ov in 
almost all the test networks, except the jazz network for 
which SLPA's result is marginally smaller (by 0.01) than that 
of Copra. SLPA outperforms Copra significantly (by > 0.1) 
on Karate, celegans and email networks. On average, LFM 
and CFinder perform worse than either SLPA or Copra. 

To have better understanding of the output from the 
detection algorithms, we showed (in Table ITTb the statistics, 
including the number of detected communities (denoted as 
Com#) , the number of detected overlapping nodes (i.e., 0%) 
and the average number of detected memberships (i.e., O^). 
Due to the space limitation, we present only results from 
SLPA (Columns 2 to 4) and CFinder (Columns 5 to 7). 

It is interesting that all algorithms confirm that the di- 
versity of overlapping nodes in the tested social networks 
is small (close to 2), although the number of overlapping 
nodes differs from algorithm to algorithm. SLPA seems 
to have a stricter concept of overlap and returns smaller 
number of overlapping nodes than CFinder. We observed 
that the numbers of communities detected by SLPA are in 



good agreement with the results from other non-overlapping 
community detection algorithms. It is consistent with the 
fact that the overlapping degree is relatively low in these 
networks. 

V. Conclusions 

In this paper, we present a dynamic interaction process 
and one of its implementation, SLPA, to allow efficient 
and effective overlapping community detection. This process 
can be easily modified to accommodate different rules (i.e. 
speaker rule, listening rule, memory update, stop criterion 
and post-processing) and different types of networks (e.g., 
k-partite graphs). Interesting future research directions in- 
clude fuzzy hierarchy detection and temporal community 
detection. 



Table II 

The statistics of the output from SLPA and CFinder. 
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